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Abstract — We examine the problem of allocating a given total 
storage budget in a distributed storage system for maximum 
reliability. A source has a single data object that is to be coded 
and stored over a set of storage nodes; it is allowed to store 
any amount of coded data in each node, as long as the total 
amount of storage used does not exceed the given budget. A data 
collector subsequently attempts to recover the original data object 
by accessing only the data stored in a random subset of the nodes. 
By using an appropriate code, successful recovery can be achieved 
whenever the total amount of data accessed is at least the size of 
the original data object. The goal is to find an optimal storage 
allocation that maximizes the probability of successful recovery. 
This optimization problem is challenging in general because of its 
combinatorial nature, despite its simple formulation. We study 
several variations of the problem, assuming different allocation 
models and access models. The optimal allocation and the optimal 
symmetric allocation (in which all nonempty nodes store the same 
amount of data) are determined for a variety of cases. Our results 
indicate that the optimal allocations often have nonintuitive 
structure and are difficult to specify. We also show that depending 
on the circumstances, coding may or may not be beneficial for 
reliable storage. 

Index Terms — Data storage systems, distributed storage, 
network coding, reliability, storage allocation. 

I. Introduction 

CONSIDER a distributed storage system comprising n 
storage nodes. A source has a single data object of 
normalized unit size that is to be coded and stored in a 
distributed manner over these nodes, subject to a given total 
storage budget T. Let Xi be the amount of coded data stored 
in node i g {1, . . . , n}. Any amount of data may be stored in 
each node, as long as the total amount of storage used over 
all nodes is at most the given budget T, i.e., 
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Fig. 1. Information flows in a distributed storage system. The source s has 
a single data object of normalized unit size that is to be coded and stored 
over n storage nodes. Subsequently, a data collector t attempts to recover the 
original data object by accessing only the data stored in a random subset r 
of the nodes. 



This is a realistic constraint if there is limited transmission 
bandwidth or storage space, or if it is too costly to mirror 
the data object in its entirety in every node. At some time 
after the creation of this coded storage, a data collector 
attempts to recover the original data object by accessing only 
the data stored in a random subset r of the nodes, where 
the probability distribution of r C {1, . . . , n} is specified by 
an assumed access model or failure model (nodes or links 
may fail probabilistically, for example). Fig. Q] depicts such a 
distributed storage system. 

The reliability of this system, which we define to be the 
probability of successful recovery (or recovery probability in 
short), depends on both the storage allocation and the coding 
scheme. For maximum reliability, we would therefore need to 
find 

(i) an optimal allocation of the given budget T over the 
nodes, specified by the values of x\, . . . , x n , and 

(ii) an optimal coding scheme 

that jointly maximize the probability of successful recovery. It 
turns out that these two problems can be decoupled by using a 
good coding scheme, specifically one that enables successful 
recovery whenever the total amount of data accessed by 
the data collector is at least the size of the original data 
object. This can be seen by considering the information flows 
for a network in which the source is multicasting the data 
object to a set of potential data collectors Q, |6|: successful 
recovery can be achieved by a data collector if and only if 
its corresponding max-flow or min-cut from the source is 
at least the size of the original data object. Random linear 
coding over a sufficiently large field would allow successful 
recovery with high probability when this condition is satisfied 
13, IU. Alternatively, a suitable maximum distance separable 
(MDS) code for the given budget and data object size would 
allow successful recovery with certainty when this condition 
is satisfied. 

Therefore, assuming the use of an appropriate code, 
the probability of successful recovery for an allocation 
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(xx, ■ ■ ■ , x n ) can be written as 

P [successful recovery] = 



> l 



Our goal is to find an optimal allocation that maximizes this 
recovery probability, subject to the given budget constraint. 

Although we have assumed coded storage at the outset, 
coding may ultimately be unnecessary for certain allocations. 
For example, if the budget is spread minimally such that 
each nonempty node stores the data object in its entirety (i.e., 
Xi > 1 for all i £ S, and xi = for all i ^ S, where S is 
some subset of {1, . . . , n}), then uncoded replication would 
suffice since the data object can be recovered by accessing 
any one nonempty node; the data collector would not need to 
combine data accessed from different nodes in order to recover 
the data object. Thus, by solving for the optimal allocation, 
we will also be able to determine whether coding is beneficial 
for reliable storage. 

We note that even though no explicit upper bound is im- 
posed on the amount of data that can be stored in each node, it 
is never necessary to set Xi > 1 because Xi = 1 already allows 
the data object to be stored in its entirety in that node. The 
absence of a tighter per-node storage constraint Xi < q < 1 
is reasonable for storage systems that handle a large number 
of data objects: we would expect the storage capacity of each 
node to be much larger than the size of a single data object, 
making it possible for a node to accommodate some of the 
data objects in their entirety. As such, it would be appropriate 
to apply a storage constraint for each data object via the budget 
T, without a separate a priori constraint for X{. Furthermore, 
the simplifying assumption of Xi being a continuous variable is 
a reasonable one for large data objects: a large data object size 
would facilitate the creation of coded data packets with sizes 
(closely) matching that of a desired allocation. Incidentally, 
the overhead associated with random linear coding or an MDS 
code, which is ignored in our model, becomes proportionately 
negligible when the amount of coded data is large. 

In spite of the simple formulation, this optimization prob- 
lem poses significant challenges because of its combinatorial 
nature and the large space of feasible allocations. Different 
variations of this problem can be formulated by assuming 
different allocation models and access models; in this paper, 
we will examine three such variations that are motivated by 
practical storage problems in content delivery networks, delay 
tolerant networks, and wireless sensor networks. 

A. Independent Probabilistic Access to Each Node 

In the first problem formulation, we assume that the data 
collector accesses each of the n nodes independently with 
constant probability p; in other words, each node i appears 
in subset r independently with probability p. The resulting 
problem can be interpreted as that of maximizing the reliability 
of data storage in a system comprising n storage devices where 
each device fails independently with probability 1 — p. It is 
not hard to show that determining the recovery probability 
of a given allocation is computationally difficult (specifically, 
#P-hard). The intuitive approach of spreading the budget 



maximally over all nodes, i.e., setting x, = ^ for all i, turns 
out to be not necessarily optimal; in fact, the optimal allocation 
may not even be symmetric (we say that an allocation is 
symmetric when all nonzero Xi are equal). The following coun- 
terexample from J9j demonstrates that symmetric allocations 



can be suboptimal: for (n,p, T) 
allocation 
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) , the nonsymmetric 



which achieves a recovery probability of 0.90535, performs 
strictly better than any symmetric allocation; the maximum 
recovery probability among symmetric allocations is 0.88889, 
which is achieved by both 
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Evidently, the simple strategy of "spreading eggs evenly over 
more baskets" may not always improve the reliability of an 
allocation. 

Our Contribution: We show that the intuitive symmetric 
allocation that spreads the budget maximally over all nodes 
is indeed asymptotically optimal in a regime of interest. 
Specifically, we derive an upper bound for the suboptimality 
of this allocation, and show that the performance gap vanishes 
asymptotically as the total number of storage nodes n grows, 
when p > y. This is a regime of interest because a high 
recovery probability is possible when p > y •<=>■ pT > 1: The 
expected total amount of data accessed by the data collector 
is given by 



= j^XiE [Yi\ =p^ Xi <pT, 



(1) 



where Yi's are independent Bernoulli(p) random variables. 
Therefore, the data collector would be able to access a 
sufficient amount of data in expectation for successful recovery 
if pT > 1. 

We also show that the symmetric allocation that spreads the 
budget minimally is optimal when p is sufficiently small. In 
such an allocation, the data object is stored in its entirety in 
each nonempty node, making coding unnecessary. Addition- 
ally, we explicitly find the optimal symmetric allocation for a 
wide range of parameter values of p and T. 

Related Work: This problem was introduced to us through 
a discussion at UC Berkeley J51- We have since learned that 
variations of the problem have also been studied in several 
different fields. 

In reliability engineering, the weighted-fc-out-of-n system 
IfTUl comprises n components, each having a positive integer 
weight Wi and surviving independently with probability pf, 
the system is in a good state if and only if the total weight 
of its surviving components is at least a specified threshold k. 
Related work on this system and its extensions has focused on 
the efficient computation of the reliability of a given weight 
allocation (see, e.g., IfTTI ). 

In peer-to-peer networking, the allocation problem deals 
with the recovery of a data object from peers that are available 
only probabilistically. Lin et al. Ifl2l compared the perfor- 
mance of uncoded replication vs. coded storage, restricted to 
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symmetric allocations, for the case where the budget is an 
integer. 

In wireless communications, the allocation problem is stud- 
ied in the context of multipath routing, in which coded data 
is transmitted along different paths in an unreliable network, 
exploiting path diversity to improve the reliability of end-to- 
end communications. Tsirigos and Haas fPH . iTPfl examined 
the performance of symmetric allocations and noted the exis- 
tence of a phase transition in the optimal symmetric allocation; 
approximation methods were also proposed by the authors 
to tackle the optimization problem, especially for the case 
where path failures occur with nonuniform probabilities and 
may be correlated. Jain et al. ITT31 evaluated the performance 
of symmetric allocations experimentally in a delay tolerant 
network setting, and presented an alternative formulation using 
Gaussian distributions to model partial access to nodes. 

Our work generalizes these previous efforts by considering 
nonsymmetric allocations and noninteger budgets. We also 
correct some inaccurate claims about the optimal symmetric 
allocation in lfl5l and its associated technical report. 

B. Access to a Random Fixed-Size Subset of Nodes 

In the second problem formulation, we assume that the 
data collector accesses an r-subset of the n nodes selected 
uniformly at random from the collection of all (™) possible 
r-subsets, where r is a given constant. The resulting problem 
can be interpreted as that of maximizing the recovery prob- 
ability in a networked storage system of n nodes where the 
end user is able or allowed to contact up to r nodes randomly. 
We can treat this access model as an approximation to the 
preceding independent probabilistic access model by picking 
r w np. Finding the optimal allocation in this case is still 
challenging. As in the first problem formulation, it is not 
hard to show that determining the recovery probability of a 
given allocation is computationally difficult (specifically, #P- 
complete). 

The problem appears nontrivial even if we restrict the 
optimization to only symmetric allocations. Numerically, we 
observe that given n and r, either a minimal or a maximal 
spreading of the budget is optimal among symmetric alloca- 
tions for most, if not all, choices of T. One example of an 
exception is (n,r,T) = (14, 5, |) for which it is optimal to 
have 8 nonempty nodes in the symmetric allocation, instead of 
the extremes 2 or 13; another example is (n, r, T) = (16, 4, |) 
for which it is optimal to have 7 nonempty nodes in the 
symmetric allocation, instead of the extremes 3 or 14. Further- 
more, the number of nonempty nodes in the optimal symmetric 
allocation is not necessarily a nondecreasing function of the 
budget T; for instance, given (n,r) = (20,4), it is optimal 
to have (4, 18, 14, 19, 20) nonempty nodes in the symmetric 
allocation for T = (4.25,4.5,4.67,4.75,5), respectively. 

Our Contribution: We show that the allocation (i, ...,-) 
is optimal in the high recovery probability regime. Specifically, 
we demonstrate that this allocation, which has a recovery 
probability of exactly 1, minimizes the budget T necessary 
for achieving any recovery probability exceeding a specified 
threshold 1 — e. Although e depends on n and r in a compli- 



cated way, we can conclude that for any r, this allocation is 
optimal if the recovery probability is to exceed 1 — ^. 

We also make the following conjecture about the optimal 
allocation, based on our numerical observations: 

Conjecture. A symmetric optimal allocation always exists for 
any n, r, and T. 

Related Work: Sardari et al. Ifl6l presented a method 
of approximating an optimal solution to this problem by 
considering a data collector that accesses r random nodes 
with replacement. More recently, Alon et al. iflTl showed that 
this problem is related to an old conjecture by Erdos on the 
maximum number of edges in a uniform hypergraph lfl8l . 

C. Probabilistic Symmetric Allocations 

In the third problem formulation, we assume a probabilistic 
allocation model in which the source selects a random allo- 
cation from a distribution of allocations, with the constraint 
that the expected total amount of storage used in an allocation 
is at most the given budget T. We specifically consider the 
case where each of the n nodes is selected by the source 
independently with constant probability min(— ,l) to store 
a constant 4 amount of data, thus creating a probabilis- 
tic symmetric allocation of the budget. The data collector 
subsequently accesses an r-subset of the n nodes selected 
uniformly at random from the collection of all (™) possible 
r-subsets, where r is a given constant. The goal is to find an 
optimal allocation, specified by the value of parameter I, that 
maximizes the recovery probability. This model was conceived 
as a simplification of the preceding fixed-size subset access 
model which assumes a deterministic allocation of the budget. 

Our Contribution: We show that the choice of i = r, which 
corresponds to a maximal spreading of the budget, is optimal 
when the given budget T is sufficiently large, or equivalently, 
when a sufficiently high recovery probability (specifically, | or 
higher) is achievable. We believe this is a reasonable operating 
regime for applications that require good reliability. 

D. Other Related Work 

Apart from the work done on the preceding problems, a va- 
riety of storage allocation problems have also been studied in a 
nonprobabilistic setting. For instance, the objective adopted in 
|fl9ll and ll20l is to minimize the total storage budget required 
to satisfy a given set of deterministic recovery requirements 
in a network. Incidentally, the use of network coding makes 
it easier to deal with the total cost of content delivery, which 
covers the initial dissemination, storage, and eventual fetching 
of data objects; this cost-minimization problem is considered 
in J6) and l2~T1l . subject to various deterministic constraints 
involving, for example, load balancing or fetching distance. 

We note that in most of the literature involving reliable 
distributed storage, either the data object is assumed to be 
replicated in its entirety (see, e.g., Il22l ). or, if coding is used, 
every node is assumed to store the same amount of coded 
data (see, e.g., I23l - l27l ). Allocations of a storage budget 
with nodes possibly storing different amounts of data are not 
usually considered. 



4 



TABLE I 
Notation 



TABLE II 

Optimal Allocations for Number of Nodes n - 



2,3,4 



Symbol Definition 



n 



total number of storage nodes, n > 2 
amount of data stored in storage node i, 

Xi > 0, where i € {1, . . . , n} 
total storage budget, 1 < T < n 
subset of nodes accessed, r C {1, . . . ,n} 
access probability (Section [TT^, < p < 1 
number of nodes accessed (Section IllTk 1 < r < n 
amount of data stored in each nonempty node 
(Section |lVj, £>0 
B(n,p) binomial random variable with n trials and 
success probability p 
1 [G] indicator function; 1 [G] = 1 if statement G is true, 
and otherwise 



T 
r 

P 

r 
i 



the set of nonnegative integers, i.e., Z + U {0} 



In the following three sections, we define each problem 
formally and state our main results. Proofs of theorems are 
deferred to the appendix. Table U summarizes the notation used 
throughout this paper. 
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II. Independent Probabilistic Access to Each Node 

In the first variation of the storage allocation problem, we 
consider a data collector that accesses each of the n nodes 
independently with probability p; successful recovery occurs 
if and only if the total amount of data stored in the accessed 
nodes is at least 1. We seek an optimal allocation (x\, . . . ,x n ) 
of the budget T that maximizes the probability of successful 
recovery, for a given choice of n, p, and T. This optimization 
problem can be expressed as follows: 



ni(n,j>,T) : 
maximize V p^ (1 - p) n "l r l • 1 

rC{l,...,n} 

subject to 



]>>>! 



i=l 



■ 



V i€ {!,...,«}. 



(2) 

(3) 
(4) 



The objective function <(2j is just the recovery probability, 
expressed as the sum of the probabilities corresponding to 
the subsets r that allow successful recovery. An equivalent 
expression for (O is 

n 

^2x l Y l >l 

.i=i 

where Yi's are independent Bernoulli(p) random variables. In- 
equality OJ expresses the budget constraint, and inequality (0]i 
ensures that a nonnegative amount of data is stored in each 
node. For the trivial budget T = 1, the allocation (1,0,..., 0) 
is optimal; for T = n, the allocation (!,...,!) is optimal. 



Incidentally, computing the recovery probability of a given 
allocation turns out to be #P-hard: 

Proposition 1. Computing the recovery probability 



E 

rC{l,... 



p |r| (l -p)"- |r| -1 



> l 



for a given allocation {x\ 



and choice of p is #P-hard. 



Table HH lists the optimal allocations for n = 2,3,4, cov- 
ering all parameter values of p E (0, 1) and T G [1, n). These 
solutions are obtained by minimizing T for each possible value 
of the objective function (|2]). We observe that 

(i) for any T, the symmetric allocation (1, . . . , 1, 0, . . . , 0), 
which corresponds to a minimal spreading of the budget 
(uncoded replication), appears to be optimal when p is 
sufficiently small, and 

(ii) the optimal symmetric allocation appears to perform 
well despite being suboptimal in some cases, e.g., when 
(n,T)= (4,§) andp> \. 

We will proceed to show that observation [(!)] is indeed true in 
Section III-Bt the opposite approach of spreading the budget 
maximally over all nodes turns out to be asymptotically 
optimal when p is sufficiently large, as will be demonstrated 
in Section IH-AI Motivated by observation |(ii)| we examine 
the optimization problem restricted to symmetric allocations 
in Section III-CI 

For brevity, let x(n, T, m) denote the symmetric allocation 
for n nodes that uses a total storage of T and contains exactly 
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m G {1, 2, . . . , n} nonempty nodes: 

T T 



p 1.0 




m entries (n — m) entries 

Since successful recovery for the symmetric allocation 
x(n, T, m) occurs if and only if at least [l/ (^)] = [f ] 
out of the m nonempty nodes are accessed, the corresponding 
probability of successful recovery can be written as 



P s (p,T,m) =P B(m,p) > 



T 



A. Asymptotic Optimality of Maximal Spreading 

The recovery probability of the symmetric allocation 
x (n, T, m=n), which corresponds to a maximal spreading of 
the budget over all nodes, is given by 



P s (p,T, ui- 



B{n lP ) > 



(5) 



To establish the optimality of this allocation, we compare Q 
to an upper bound for the recovery probability of an optimal 
allocation. Such a bound can be derived by conditioning on 
the number of accessed nodes: 

Lemma 1. The probability of successful recovery for an 
optimal allocation is at most 



( rT 

r=0 v 



'[B(n,p) = 



(6) 



The suboptimality of x (n, T, m=n) is therefore bounded by 
the difference between (O and (O, as given by the following 
theorem; when p > ^, this allocation becomes asymptotically 
optimal since its suboptimality gap vanishes as n goes to 
infinity: 

Theorem 1. The gap between the probabilities of successful 
recovery for an optimal allocation and for the symmetric 
allocation x(n, T, m=n), which corresponds to a maximal 
spreading of the budget over all nodes, is at most 



pT] 



B(n-l,p)< 



- 2 



If p and T are fixed such that p > ^ , then this gap approaches 
zero as n goes to infinity. 

We note that the regime p > ^ is particularly interesting 
because it corresponds to the regime of high recovery probabil- 
ity; the recovery probability would be bounded away from 1 if 
p < pT < 1 instead. This follows from the application 

of Markov's inequality to the random variable W denoting 
the total amount of data accessed by the data collector, which 
produces 

F[W> 1] < E[W}. 

Since P [W > 1] is just the probability of successful recovery, 
and E [W] < pT according to (Q]i, we have 

P [successful recovery] < pT. 




Theorem 2 
gJ] Corollary 1 



Fig. 2. Plot of access probability p against budget T, showing regions 
of (T,p) over which the sufficient conditions of the theorems are satisfied, 
for n = 20. Minimal spreading (uncoded replication) is optimal among all 
allocations in the colored regions. 



B. Optimality of Minimal Spreading (Uncoded Replication) 

The recovery probability of the symmetric allocation 
x (n, T, m=[T\ ), which corresponds to a minimal spreading 
of the budget, is given by 



P s (p, T, m= [T\ ) = P [B ( [T\ , p) > 1] = 1 - ( 1 -p) 



(7) 



Recall that coding is unnecessary in such an allocation since 
the data object is stored in its entirety in each nonempty node. 
A sufficient condition for the optimality of this allocation 
can be found by comparing to an upper bound for the 
recovery probabilities of all other allocations. Our approach 
is to classify each allocation according to the number of 
individual nodes that store at least a unit amount of data. We 
then find a bound for allocations containing exactly such 
nodes, another bound for allocations containing exactly 1 such 
node, and so on. The subsequent comparisons of Q to each 
of these bounds result in the following theorem: 

Theorem 2. Ifl<T<n and 



l-(l-p) 



LTj-r 



+ {n-i) 



1 1 



r=2 



1-P 

n-£ 
r 



1-p 



>0 (8) 



for all £ G {0, 1, . . . , \ T\ — 1}, then x (n, T, m=[T\), which 
corresponds to a minimal spreading of the budget ( uncoded 
replication), is an optimal allocation. 

The following corollary shows that this allocation is indeed 
optimal for sufficiently small p: 



Corollary 1. If 1 < T < n and p < 

x (n, T, m~\T\ ) is an optimal allocation. 



(n-[T\y 



then 



Fig. |2] illustrates these results in the form of a region plot 
for an instance of n. 
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Fig. 3. Plot of recovery probability Ps against budget T for each symmetric 
allocation x(n,T,m), for (n,p) = (20, Parameter m denotes the num- 
ber of nonempty nodes in the symmetric allocation. The black curve gives an 
upper bound for the recovery probability of an optimal allocation, as derived 
in Lemma [T] 



C. Optimal Symmetric Allocation 

The optimization problem appears nontrivial even if we 
were to consider only symmetric allocations. Fig. [3] which 
compares the performance of different symmetric allocations 
over different budgets for an instance of (n,p), demonstrates 
that the value of m corresponding to the optimal symmetric 
allocation can change drastically as the budget P varies. 

Fortunately, we can eliminate many candidates for the 
optimal value of m by making the following observation: 
Recall that the recovery probability of the symmetric allocation 
x(n,T,m) is given by P s (p,T,m) = ¥[B(m,p) > [f ]]. 
For fixed n, p, and P, we have 



= k when m G ((fc - 1)P, fcP] , 

for k = 1,2,..., [f J, and finally, 

/ n 

- T,n . 







n 


P 




-T. 



1 when m G 



Since ¥[B(m,p) > k] is nondecreasing in m for constant p 
and k, it follows that Ps (p, P, m) is maximized within each of 
these intervals of m when we pick m to be the largest integer 
in the corresponding interval. Thus, given n, p, and P, we can 
find an optimal m* that maximizes Ps (p, T, m) over all m 
from among |~^] candidates: 



{LTj,L2Tj,...,[[^jTj,n} 



(9) 



For m = \ kT\, where k G Z + , the corresponding probability 
of successful recovery is given by 

P s {p,T 7 m=[kT\) = P[B{[kT\,p) > k}. 

The difference between the probabilities of successful recovery 
for consecutive values of A; € Z + can be written as 

A(p, T, k) 4 P s (p, P, m=[(k + 1)TJ) - P s (p, T, m=[kT\ ) 
= P[6(L(fc + l)Tj,p) > fc + 1] -P[B([kT\,p) > k) 



min(ctk,T— 

= ^ P[S(LfcTj,p) = fc- 

1=1 



P[B(a fc ,r,p)>» + l] 
[B(LfcTJ,p) = A] -P[B(a fe , T ,p) = 0], 



where Q fc>T = [(fc + 1)TJ - L^PJ- The above expression is 
obtained by comparing the branches of the probability tree 
for \kT\ vs. [(k + 1)PJ independent Bernoulli(p) trials: the 
first term describes unsuccessful events ("B ([kT\,p) < k") 
becoming successful ("B ([(k + 1)PJ ,p) > k + 1") after the 
additional ak,T trials, while the second term describes suc- 
cessful events ("B ( \kT\ ,p) > k") becoming unsuccessful 
("B([(k + l)T\,p) < k + 1") after the additional a ktT trials. 
After further simplification, we arrive at 



A(p,T,k)=p k (l-p)^ k +^ T ^ k - 

f min( Qfc , T -l,k) 

E 



a k,T 

E 

i=i+l 



LfcTj \ 
k-i) 



3 



LfcTj 
k 



(10) 



The following theorem essentially provides a sufficient 
condition on p and T for A(p,T,k) > for any k E Z + , 
thereby eliminating all but the two largest candidate values for 
m* in (0, i.e., m = \\jf\T\ and m = n, which correspond 
to a maximal spreading of the budget over (almost) all nodes 



(they are identical when ^ € Z + , i.e., P 
Theorem 3. If 

(1-p)^ +2[T\p(l 



n n 
2 ' 3 1 



.): 



p)LrJ-i_ 1 <o, 



(11) 
which 



then either x (n, P, m= [ [yj Tj ) or x (n, P, m=n 
correspond to a maximal spreading of the budget, is an optimal 
symmetric allocation. 

The following corollary restates Theorem [3] in a slightly 
weaker but more convenient form: 

Corollary 2. If p> ^j, then either x (n,T,m=[[^ JrJ) 
or x (n, P, m—n) is an optimal symmetric allocation. 

The following lemma mirrors Theorem [3] by providing a 
sufficient condition on p and P for A(p, T, fc) < for any 

G Z + , thereby eliminating all but the smallest candidate 
value for m* in (0, i.e., m = [T\, which corresponds to a 
minimal spreading of the budget (uncoded replication): 



Lemma 2. IfT>\, and either 



1 



P < — a«of p (1 
P 



P = - G IS 
V 



i) 



fri-i 



(12) 



(13) 



f/ie« x(n,P, m=[Pj) is an optimal symmetric allocation. 

The following lemma restates Lemma [2] in a slightly weaker 
but more convenient form: 

Lemma 3. If p < — ^, then x (n, T, m~[T\ ) is an 
optimal symmetric allocation. 
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1 Theorem 3 
I Corollary 2 

Theorem 4 
Lemma 3 



Fig. 4. Plot of access probability p against budget T, showing regions of 
(T, p) over which the sufficient conditions of the theorems are satisfied. The 
black dashed curve marks the points satisfying p = i . Maximal spreading is 
optimal among symmetric allocations in the colored regions above the curve, 
while minimal spreading (uncoded replication) is optimal among symmetric 
allocations in the colored regions below the curve. 



The following theorem expands the region covered by 
Lemma |3] by showing that x (n, T, m=\T\) remains optimal 
between the "peaks" in Fig. |4] 



Theorem 4. If p < j^-r, then x (n, T, m=\T\ ), which cor- 
responds to a minimal spreading of the budget (uncoded 
replication), is an optimal symmetric allocation. 

Fig. 2] illustrates these results in the form of a region plot. 
The theorems cover all choices of p and T except for the 
gap around p = which diminishes with increasing T. Both 
minimal and maximal spreading of the budget may be subopti- 
mal among symmetric allocations in this gap on either side of 
the curve p = ^: for example, when (n,p,T) — (10, |), 
for which p < ik, the optimal symmetric allocation is 



10 5 12^ 

u ' 5' 5 



for which 



x(n,T,m=L2Tj); when (n,p,T) = 
p > ^7, the optimal symmetric allocation is x (n, T, m=|_3T] 
In general, for any budget T > 2, the optimal symmetric 
allocation changes from minimal spreading to maximal spread- 
ing eventually, as the access probability p increases. This 
transition, which is not necessarily sharp, appears to occur at 
around p = ^. Interestingly, when p = ^ exactly, we observe 
numerically that x (n, T, m=\T\ ) is the optimal symmetric 
allocation for most values of T; the optimal symmetric allo- 
cation changes continually over the intervals 

1.5 < T < 2 and 2.5 < T < 2.8911, 

while x(n,T,m=L2Tj) is optimal for 3.5 < T < 3.5694. 
These findings suggest that it may be difficult to specify an 
optimal symmetric allocation for values of p and T in the gap; 
we can, however, restrict our search for an optimal symmetric 
allocation to the [^t] candidates given by (|9). 



III. Access to a Random Fixed-Size Subset of Nodes 

In the second variation of the storage allocation problem, 
we consider a data collector that accesses an r-subset of the n 
nodes selected uniformly at random from the collection of all 



( ™ ) possible ?'-subsets, where r is a given constant; successful 
recovery occurs if and only if the total amount of data stored in 
the accessed nodes is at least 1, We seek an optimal allocation 
(xi, . . . , x n ) of the budget T that maximizes the probability 
of successful recovery, for a given choice of n, r, and T. This 
optimization problem can be expressed as follows: 



II 2 (n,r,T) : 



maximize 

x 1 ,...,x„,P s 



subject to 



rC 



{l,...,n}: Vr/ 
| r | — r 

n 



5>>i 



> Ps 



> Vie {!,... ,n}. 



(14) 



(15) 



(16) 



(17) 



The left-hand side of inequality (fT3T l is just the recovery 
probability, expressed as the sum of the probabilities corre- 
sponding to the r-subsets r that allow successful recovery. 
The objective function ( TBl i is therefore equal to the recovery 
probability since Ps is maximized when <TT~5T > holds with 
equality. Inequality ( TT6I ) expresses the budget constraint, and 
inequality ( fTTI i ensures that a nonnegative amount of data is 
stored in each node. For the trivial budget T = 1, the allocation 
(1,0,..., 0) is optimal; for T > -, the allocation (-,..., -), 
which has the maximal recovery probability of 1, is optimal. 
Incidentally, computing the recovery probability of a given 
allocation turns out to be #P-complete: 

Proposition 2. Computing the recovery probability 



1 

£ 7"T ' 1 



rC{l,...,n}: v r 
|r|— r 



5>z > 1 



for a given allocation (x\, . . . , x n ) and choice of r is #P- 
complete. 

An alternate way of formulating this problem is to minimize 
the budget T required to achieve a desired recovery probability 

n' 2 {n,r,P s ) ■ 

minimize T 

subject to the three constraints (TT5b-(fT7b of n^n, r, T). 

Fig. [5] shows how the optimal recovery probability maxPs 
varies with the budget T, for two instances of (n,r). These 
plots are obtained by solving Tl' 2 {n,r, Ps) f° r eacn possible 
value of Ps- We observe that when the budget T drops below 
-, the optimal recovery probability maxPs is reduced by 
a significant margin below 1. In other words, if the desired 
recovery probability Ps in U' 2 (n, r, Ps) is sufficiently high, 
then the optimal allocation is (£,..., £), which requires a 
budget of T = — . In Section ITlI-AI we examine the optimality 
of this allocation for the high recovery probability regime. 
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0.5 1.0 1.5 2.0 2.5 3.0 

(a) (n,r) = (6,2) 



<> (iiiii°) 

(1,1,0,0,0,0) 



(1,0,0,0.0.0) 



T (0,0,0,0,0,0) 




(I. i. |.o,o) 

(1,0,0,0,0) 



0.6 0.8 1.0 1.2 

(b) (n, r) = (5, 3) 



(0,0,0,0,0) 



Fig. 5. Plot of the optimal recovery probability max Ps against budget T, 
for (a) (n, r) = (6, 2) and (b) (n, r) = (5, 3). The optimal allocation corre- 
sponding to each value of max Ps is given on the right-hand side of the plot. 
In (a), the red dashed line marks the threshold on Ps derived in Theorem[3] the 
allocation (—,..., A) is optimal for Tl 2 (n,r, Ps) if and only if the desired 
recovery probability Ps exceeds this threshold. In (b), the red dashed line 
marks the threshold on Ps derived in Theorem [6] the allocation (-,...,-) 
is optimal for YI' 2 (n, r, Ps) if Ps exceeds this threshold. 



A. Regime of High Recovery Probability 

Consider the optimization problem Yl' 2 {n : r^ Ps). We will 
demonstrate that the allocation (i . . A) is optimal when the 
desired recovery probability Ps exceeds a specified threshold 
expressed in terms of n and r. Our results follow from the 
observation that successful recovery for certain combinations 
of r-subsets of nodes can impose a lower bound on the 
required budget T. For example, given (n, r) = (4, 2), if 
successful recovery is to occur for {1, 2} and {3, 4}, possibly 
among other r-subsets of nodes, then we have 

^ Xi > 1 and ^ Xi > 1, 

iG{l,2} i£{3,4} 

which would imply that the minimum budget T must be at 
least 2, since 

4 

T > x i = Yl x%+ x t >2. 

i=l i€{l,2} i£{3,4} 

This observation is generalized by the following lemma: 

Lemma 4. Consider a set SC {1, . . . , n}, and c subsets of 
S given by r j C S, j = 1, . . . , c. If 



Vj6{l c}, 



(18) 



c subsets, i.e., 



then 



(19) 



E 



Xi > 



We begin with the special case of probability-1 recovery, 
i.e., Ps = 1. The resulting optimization problem is just a linear 
program with all ("J possible r-subset constraints. 

'i i N 



Lemma 5. If P s = 1, then 



is an optimal allocation. 



When the desired recovery probability Ps is less than 1, we 
can afford to drop some of the r-subset constraints from 
this linear program (recall that the recovery probability of 
an allocation is just the fraction of these (") constraints 
that are satisfied). Our task is to determine how many such 
constraints can be dropped before the lower bound for T 
obtained with the help of Lemma H] falls below in which 
case the allocation may no longer be optimal. We 

do this by constructing collections of r-subset constraints that 
yield the required lower bound of — for T, and counting 
how many r-subset constraints need to be removed from the 
linear program before no such collection remains. Our answer 
depends on the divisibility of n by r. 

When n is a multiple of r, we are able to state a necessary 
and sufficient condition on Ps for the allocation to be optimal: 

Theorem 5. If n is a multiple of r, then (=)•••>-) is an 
optimal allocation if and only if 

„ r 
Ps > 1 - -• 
n 

When n is not a multiple of r, we are only able to state a 
sufficient condition on Ps for the allocation to be optimal: 

Theorem 6. If n is not a multiple of r, then is an 

optimal allocation if 

gcd(r, r') 

^ s > 1 77 7\ — ; — 7> 

a gcd(r, r) + r 

where a and r' are uniquely defined integers satisfying 

n = ar + r', a e Zq , r' G {r + 1, . . . ,2r - 1}. 

However, if n is a multiple of (n — r), then this sufficient 
condition becomes necessary too: 

Corollary 3.1fn is a multiple of (n — r), then (K . . . , i) is 
an optimal allocation if and only if 

„ r 
Ps>~- 

n 

Note that Corollary [3] allows us to solve II2 (n, r, T) com- 
pletely when n is a multiple of (n — r): for any T £ [l, -), 
the allocation (1,0,..., 0) is optimal since it has a recovery 



probability of 



— , i.e., exactly the threshold in 



and each element in S appears exactly b > times among the 



Corollary |3j higher recovery probabilities are not achievable 
unless T > -. 

— r 

Fig. [6] illustrates these results for an instance of n. 
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Fig. 6. Plot of the desired recovery probability Ps against the number 
of nodes accessed r, showing intervals of Ps over which the allocation 
(—,...,—) is optimal for Il^ra, r, Ps), for n = 40. A dotted circle marker 
denotes an endpoint that may not be tight, i.e., we have not demonstrated that 
the allocation is suboptimal everywhere outside the interval. 



By combining the proof techniques of Lemma Q] and 
Theorems |2] [5] and [6] we can derive the improved upper 
bound P S UB , given by ( f20l > at the bottom of the page, for the 
recovery probability of an optimal allocation in the indepen- 
dent probabilistic access model of Section [II] (cf. Lemma [TJ. 
Variables a and r' are uniquely defined integers satisfying 



ar + r', a € Zq 



g {r,...,2r-l}. 



Parameter I denotes the number of individual nodes that 
store at least a unit amount of data. At least £ amount of 
data is stored in these complete nodes, leaving the remaining 
budget of at most T — £ to be allocated over the remaining 
n — £ incomplete nodes. Term (i) gives the probability of 
successful recovery from accessing at least one complete node, 
while term (ii) gives an upper bound on the probability of 
successful recovery from accessing exactly r € {2, . . . , n — £} 
incomplete nodes. 

IV. Probabilistic Symmetric Allocations 

In the third variation of the storage allocation problem, we 
consider the case where each of the n nodes is selected by the 
source independently with probability min(— , l) to store 4 
amount of data, so that the expected total amount of storage 
used in the resulting symmetric allocation is at most n ■ — ■ 4 
= T, the given budget. The data collector subsequently 
accesses an r-subset of the n nodes selected uniformly at 
random from the collection of all (™) possible r-subsets, 
where r is a given constant; successful recovery occurs if 



and only if the total amount of data stored in the accessed 
nodes is at least 1. We seek an optimal probabilistic symmetric 
allocation of the budget T, specified by the value of parameter 
£, that maximizes the probability of successful recovery, for 
a given choice of n, r, and T. Since successful recovery 
for a particular choice of £ occurs if and only if at least 
[1/(7)] = \£~\ out of the r accessed nodes are nonempty, 
the corresponding probability of successful recovery can be 
written as 

P s (n,r,T,£) 4p [b (r,min(f,l)) > \£ 

This optimization problem can therefore be expressed as 
follows: 



n 3 (n,r,T) : 



maximize 



B(r,nrin(f,l))>[*1 



subject to £ > 0. 

For budget T > -, the choice of £ = r, which yields a 
recovery probability of P [B (r, 1) > r] = 1, is optimal. 

Observe that the recovery probability Ps(n,r,T,l) is zero 
when £ > r. Furthermore, for fixed n, r, and T, the recovery 
probability is nondecreasing in £ within each of the unit 
intervals (0,1], (1,2], (2,3], . . ., since as £ increases within 
each interval, \£~\ remains constant while min(^,l) either 
increases or remains constant at 1. Thus, given n, r, and T, 
we can find an optimal £* from among r candidates: 



{1,2,. ...r}. 



(21) 



Fig. |2l which compares the performance of different prob- 
abilistic symmetric allocations over different budgets for an 
instance of r, suggests that there are two distinct phases 
pertaining to the optimal choice of £: when the budget is below 
a certain threshold, the choice of £ = 1, which corresponds to 
a minimal spreading of the budget (uncoded replication), is 
optimal; when the budget exceeds that same threshold, the 
choice of I = r, which corresponds to a maximal spreading 
of the budget, becomes optimal. This observation echoes our 
findings on the allocation and access models of the preceding 
sections, namely that minimal spreading (£ = 1) is optimal for 
sufficiently small budgets, while maximal spreading {£ = r) is 
optimal for sufficiently large budgets. However, we note two 
important distinctions in contrast to the previous models. First, 
the recovery probability for a probabilistic symmetric alloca- 
tion in this model is a continuous nondecreasing function of 
the given budget; there are no "jumps" from one discrete value 
to the next. Second, our empirical computations suggest that 



in) 



UB A 



<e{o,i,...,|rj} 



(1-pY + t[t< [T\].(l-p) 

/ r(T - £) 
n-£ 



mir 



,1-1 



T 



< 



gcd(r, r') 
agcd(r, r') + r' 



cf. Lemma [T1 



cf. Theorems [5l and l6l 



'[B(n-£,p) = r}. 



(20) 
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Fig. 7. Plot of recovery probability Ps against budget-per-node — for 
each choice of parameter I 6 {1, 2, . . . , r}, for r = 10. Parameter i controls 
how much the budget is spread in the probabilistic symmetric allocation; 
specifically, each of the n nodes is selected by the source independently 
with probability min(-^,l) to store j amount of data. Arrows indicate 
the direction of increasing £. The black dashed line marks the threshold on 
— derived in Theorem [JJ maximal spreading (t = r) is optimal for any — 
greater than or equal to this threshold. 



Fig. 8. Plot of recovery probability Ps against the number of nodes accessed 
r, indicating the value of Ps at which the optimal choice of parameter 
i changes from 1 to r, for each given value of r. Specifically, if it is 
possible to achieve a recovery probability Ps above the square marker, 
then maximal spreading (£ = r) is optimal; otherwise, minimal spreading or 
uncoded replication (i = 1) is optimal. Observe that the critical value of Ps 
for r = 10 (which is approximately 0.633652) corresponds to the intersection 
point of the curves for t = 1 and I = 10 in Fig. [JJ 



the phase transition from the optimality of minimal spreading 
to that of maximal spreading in this model is sharp; the other 
intermediate values of £ £ {2, . . . , j — 1} never perform better 
than both I = 1 and I = r simultaneously. 

In Section IIV-A1 we shall demonstrate that the choice of 
£ = r, which corresponds to a maximal spreading of the 
budget, is indeed optimal when the given budget T is suffi- 
ciently large, or equivalently, when a sufficiently high recovery 
probability is achievable. 

A. Optimality of Maximal Spreading 

Assume that r > 2. As noted earlier, the choice of £ = r, 
which corresponds to a maximal spreading of the budget, is 
optimal for any T > — because it yields the maximal recovery 
probability of 1. The following lemma provides an upper 
bound for the recovery probabilities corresponding to the other 
candidate values for £* in (l2lT l at the critical budget T = j: 

Lemma 6. The probability of successful recovery Ps(n, r, T, £) 
at T = — is at most | for any i £ {1, 2, . . . , r — 1}. 

Such an upper bound allows us to derive a sufficient condition 
for the optimality of I = r, by making use of the fact that the 
recovery probability Ps(n, r, T, £) is a nondecreasing function 
of the budget T. The following theorem shows that the choice 
of £ = r is optimal when the budget T is at least a specified 
threshold expressed in terms of n and r: 



Theorem 7. // 



T > 



then the choice of £ = r, which corresponds to a maximal 
spreading of the budget, is optimal. 

The following corollary states an equivalent result in terms 
of the achievable recovery probability; it demonstrates the 
optimality of I = r in the high recovery probability regime: 



Corollary 4. If a probability of successful recovery of at least 
| is achievable for the given n, r, and T, then the choice of 
I = r is optimal. 

Fig. [8] describes the optimal choice of £ for different values 
of r. We observe that the gap between the threshold of | 
derived in Corollary [4] and the actual critical value of Ps 
indicated in the plot appears to be no more than 0.12. 



V. Conclusion and Future Work 

We examined the problem of allocating a given total storage 
budget in a distributed storage system for maximum reliability. 
Three variations of the problem were studied in detail, and we 
are able to specify the optimal allocation or optimal symmetric 
allocation for a variety of cases. Although the exact optimal 
allocation is difficult to find in general, our results suggest a 
simple heuristic for achieving reliable storage: when the budget 
is small, spread it minimally; when the budget is large, spread 
it maximally. In other words, coding is unnecessary when the 
budget is small, but is beneficial when the budget is large. 

The work in this paper can be extended in several directions. 
We can impose additional system design constraints on the 
model; one practical example is the application of a tighter 
per-node storage constraint Xj < Cj < 1. The independent 
probabilistic access model of Section HI] can be naturally 
generalized to the case of nonuniform access probabilities 
Pi for individual nodes. It would also be interesting to find 
reliable allocations for specific codes with desirable encoding 
or decoding properties, e.g., sparse codes that offer efficient 
algorithms (see, e.g., I24l - ll27l ). A related problem would 
be to construct such codes that work well under different 
allocations. Another set of interesting problems involves the 
application of richer access models; for instance, we can 
introduce a network topology to a set of storage nodes and 
data collectors, and allow each data collector to access only 
the nodes close to it. More generally, we can assign different 
priorities to each node for data storage and access, so as to 
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reflect the costs of storing data in the node and communicating 
with it. 



Appendix 
Proofs of Theorems 

Proof of Proposition Q} We note that the computa- 
tional complexity of this problem was well understood in 
the Berkeley meetings [9] and is by no means a major 
contribution in this paper. We present the detailed proofs here 
for completeness. 

Consider an allocation (xi, . . . ,x n ) where each Xj is a 
nonnegative rational number. The problem of computing the 
recovery probability of this allocation for the special case 
of p = |, for which p^(l — p)"~! r ' = (i) for any subset 
r C {1, . . . ,n}, is equivalent to the counting version of the 
following decision problem (which happens to be polynomial- 
time solvable): 

Definition. Largest Subset Sum (LSS) 
Instance: Finite n-vector (oi, . . . , a n ) with a, € Zq , and file 
size d € Z + , where all asj and d can be written as decimal 
numbers of length at most I. 

Question: Is there a subset rC {!,... ,n} that satisfies 



TABLE III 

Constructing a #LSS instance for a given #3SAT instance 



E, 



> d? 



Note that the allocation and file size have been scaled so 
that the problem parameters are all integers. We will proceed 
to show that the counting problem #LSS is #P-complete; 
this would in turn establish the #P-hardness of computing the 
recovery probability for an arbitrary value of p. 

The index set r can be represented as an n- vector of bits. 
Using this representation of r as the certificate, it is easy to 
see that the binary relation corresponding to #LSS is both 
polynomially balanced (since the size of each certificate is 
n), and polynomial-time decidable (since the inequality can 
be verified in 0(n£) time for each certificate). It therefore 
follows that #LSS is in #P. 

To show that #LSS is also #P-hard, we describe a 
polynomial-time Turing reduction of the #P-complete prob- 
lem #3 SAT |28) to #LSS. Our approach is similar to the 
standard method of reducing 3SAT to SUBSET Sum (see, 
e.g., 11291 ). Let 4> be the Boolean formula in the given 
#3SAT instance; denote its m variables by v\, . . . , v m , and 
k clauses by C\, . . . , Cfe. To count the number of satisfying 
truth assignments for <fr, we construct a #LSS instance with 
the help of Table [TTT1 whose entries are 0, 1, 2, or 3 (all 
blank entries are 0's). The entries of the n-vector for the 
#LSS instance are given by the first (2m + 3/c) rows of 
the table; the file size d is given by the last row of the 
table. Each entry a,, i € {1, . . . , 2m + 3k}, as well as d, 
is a positive integer with at most (m + 2k) decimal digits. 
Observe that the set of satisfying truth assignments for <fi can 
be put in a one-to-one correspondence with the collection 
of subsets r C {1, . . . , 2m + 3fc} that satisfy J^ier a * = ^ ; 
for each i <G {1, . . . , m}, we have "v" <E r if and only if 
Vi = TRUE, and " ! y7" G r if and only if vi = FALSE. There- 
fore, if /((oi, . . . , a n ), d\ is a subroutine for computing 
#LSS, then the number of satisfying truth assignments can 
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be computed by calling / twice: first with d taking the value 
as prescribed above, and second with d taking the prescribed 
value plus one. The difference between the outputs from the 
two subroutine calls is equal to the number of distinct subsets 
r that satisfy ^\ 



fer 



d, which is equal to the number of 
satisfying truth assignments for <j>. Finally, we note that this is 
indeed a polynomial-time Turing reduction since the table can 
be populated in O (m 2 k 2 ) simple steps, and the subroutine / 
is called exactly twice. ■ 
Proof of Lemma Q} Consider a feasible alloca- 
tion (xi, . . . , x n ); we have ElLi^i^^ where xi > 0, 
i = l,...,n. Let S r denote the number of ?'-subsets of 
{xi,...,x n } that have a sum of at least 1, where 
re{l,...,n}, By conditioning on the number of nodes 
accessed by the data collector, the probability of successful 
recovery for this allocation can be written as 



[successful recovery] 

successful recovery | exactly r nodes were accessed] • 
F [exactly r nodes were accessed] 



s. 



= J2^-f[B(n,p)=r}. 



(22) 



We proceed to find an upper bound for S r . For a given r, we 
can write S r inequalities of the form 

x[ + ---+x' r > 1. 

Summing up these S r inequalities produces an inequality of 
the form 



a\X\ 



Since each Xi belongs to exactly 

{xi, . . . , x n }, it follows that < Oi < ( ), i = 1, . 



distinct r-subsets of 
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Therefore, 

S r < a\X\ 



n - 
r — 



< 



l)t 



Xi < 



r - 1 



T. 



Since S r is also at most ( 
we have 

S r < min 



), i.e., the total number of 7--subsets, 



T, 



n — 1 

r — 1 j \r ■ 

Substituting this bound into ( 1221 completes the proof. ■ 
Proof of Theorem Q} The suboptimality gap for the 
symmetric allocation x (n, T, m=n) is at most the difference 
between its recovery probability © and the upper bound (0 
from Lemma [T] for the optimal recovery probability. This 
difference is given by 

* — ' n V r / 

r=l 

W- 1 / 1 \ 

= T E (rZljp^-pr-r 

r=l V / 



p t e ( n :j)p r - i (i-p) (B - i) - (r - i) 

r=l ^ r ' 



£=0 x ' 

pTP [B(n- < 



- 2 



S(n,p,T), 



as required. Assuming now that p > ^, we have 



S(n,p,T)<pT] 
= pT] 



B(n- l,p) < 



T 



B(n - l,p) < — (n - l)p 
pT 



< pT exp 



(n — l)p 



1 




(23) 



(24) 



Inequality 



follows from the fact that 
1 



n 

2< T 



n 1 

2< T~r 



Inequality (l24t follows from the observation that G (0, 1), 
and the subsequent application of the Chernoff bound for 
deviation below the mean of the binomial distribution (see, 
e.g., Il30l ). For fixed p and T, this upper bound approaches 
zero as n goes to infinity. ■ 
Proof of Theorem |2} We compare the recovery proba- 
bility of x (n, T, m=\T\ ) to an upper bound for the recovery 
probabilities of all other allocations. 

Suppose that 1 < T < n. Recall from (0 that the probabil- 
ity of successful recovery for x (n, T, m=[T\ ) is given by 

P l {p,T) 4l-(i_p)LTj. 

Consider a feasible allocation (xi, . . . ,x n ); we have 
S™=i x i — T, where Xi > 0, i = 1, . . . , n. Let t be the num- 
ber of individual nodes in this allocation that store at least 



a unit amount of data; for brevity, we refer to these nodes 
as being complete. It follows from the budget constraint that 
the number of complete nodes I £ {0,1,..., [TJ }. When 
I = [T\ , the allocation has a recovery probability identical 
to Pi(p,T). Now, assuming that I G {0,1,..., |TJ - 1}, 
successful recovery can occur in two ways: 

(i) when the accessed subset contains one or more complete 
nodes, which occurs with probability 1 — (1 — p) e , or 

(ii) when the accessed subset contains no complete nodes 
but has a sum of at least 1. 

In case |(ii)| the accessed subset would consist of two or 
more incomplete nodes. Using the argument in the proof of 
Lemma [U we can show that there are at most 



-i-l 

r - 1 



(T-£), 



-I 



r-subsets of incomplete nodes whose sum is at least 1, since 
the total amount of data stored over the n — I incomplete 
nodes is at most T — i. It follows then that the recov- 
ery probability for a feasible allocation with exactly I G 
{0, 1, ... , [T\ — 1} complete nodes is at most 



P 2 (n,p,T,e)^l-(l-p) l + (l-p) 
'T — I \ (n-l 



E ] 

r=2 



r.l 



p r (l-p) 



n-l- 



Thus, 



Pi(p,T)>P 2 (n,p,T,£) 



for all £ G {0, 1, . . . , [T\ — 1} is a sufficient condition for 
x(n, T, m=\T\) to be an optimal allocation. After further 
simplification of this inequality, we arrive at inequality © as 
required. ■ 
Proof of Corollary [7} Suppose that 1 < T < n. We will 
show that the sufficient condition of Theorem [2] is satisfied 



for any p < 



Note that when n — [TJ = 1, or equiv- 



alently T G [n — 1, n), we have to show that x (n, T, m= [T\ ) 
is an optimal allocation for any p, i.e., in the interval (0, 1). 

First, observe that the summation term in inequality <[8J is 
always nonnegative, i.e., 



E 

r=1 



1 - 



T — £ 



n-£ 
r 



1-p 



>0, 



since 
{0,1, 

r < 



for any r G |2, . 
. . , |TJ - 1}, we have 

'n-l' 



\\ and £ 



T -£ 



1 



r < 



T — £ 



1 



T — £ 
n-£ 



r > 0. 



Therefore, a simpler but weaker sufficient condition for 
x (n, T, m=[T\ ) to be an optimal allocation is 

1 - (1 -p)^- n + (n- (|TJ - 1)) (JL-^ > 
^l + (n- LTjJp-a-p) 1 ^"-^) >0, 

which is an inequality in only two variables p and 
s = n—[T\, where s G {!,..., n — 1}. When s = 1, or 
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equivalently T £ [n — 1, n), this inequality is satisfied for any 
p £ (0, 1), as required. Defining the function 

f(s,p)±l + sp-(l-p) 1 - s , 

it suffices to show that f(s,p) > for any s £ Z + , s > 2, 
and p £ (0, t-1 . We do this by demonstrating that for any 
s £ Z + , s > 2, the function f(s,p) is concave in p on the 
interval p £ (0, p-1 , and is nonnegative at both endpoints, i.e., 

/(s,p=0)>0i£d/(«,p=£) > 0. 

The second-order partial derivative of /(s,p) wrt p is given 

by 

^ f ( s ,p) = - s ( s -i)(i-p)-^. 

Since J^/(s,p) < for any s £ Z+, s > 2, and p £ (0, Jr], 
it follows that the function f(s,p) is concave in p on the 
interval p £ (0, Jr] for any s £ Z+, s > 2. 

Suppose that s £ Z+, s > 2. Clearly, /(s,p=0) = 0. To 
show that / (s,p=-^-J > 0, we define the function 

2 

s 



g(s) = In [ 1 + - + (s-l)ln 1- 



and show that g(s) > for any s £ Z + , s > 2. Direct eval- 
uation of the function gives us g(s=2) = 0, and g(s=S) = 
In | - 2 In | > 0. For s > 4, we consider the derivatives of 

9(s): 



9'(s) 

g"(s) 



1 2(s-2) 



In 



2 



6s -2) 



s 2 (s + 2) 2 (s 2 -2)' 



Since g"(s) > for any s > 4, and lim s ->-oo = 0, it 
follows that g'(s) < for any s > 4. Now, since g'(s) < 
for any s > 4, and lim^oo g(s) = 0, it follows that g(s) > 
for any s > 4. Therefore, for any s £ Z + , s > 2, we have 



In 1 



f)+(*-l)ln f 1 -! 



2 

- > 

s 



1 — — — 



/ s,p= 



>0, 



as required. ■ 
Proof of Theorem [5} We will show that if condition (fTTT i 
is satisfied, then A(p, T,k) > for any fc £ Z + . First, we note 
that 



1 

IrJ 



(26) 



Inequality d25T l follows from the fact that 

[kr\ < kr <k <^ L fcr J <k-l L fcr l - fc + 1 < 0. 

Now, if condition (fTTT i is satisfied, then we necessarily have 
T > 2; otherwise, T £ [1,2) would imply that |TJ = 1, which 
produces (1 -p)L T J + 2|Tjp(l - p)^- 1 - 1 = p > 0, con- 
tradicting our assumption. It follows that 

(1 - p) LTJ + 2 LTjp(l - p) LTJ - 1 - 1 < 
-^P[6(LTj,p) =0] +2P[6(LTj,p) = 1] -1 < 
^P[S(LTj,p)>2] >P[B(LTj,p) = l] 

[TJ 

E ( lT / ) ^ - f) LrJ_j ^ Lrjp(i - p)^" 1 



J=2 



i-p 



i-p 



> 1. 



(27) 
(28) 



Observe that a KT = [(k + 1)TJ - LfcTj £ {[TJ, \T}}, be- 
cause ctfc.T £ (T — 1, T + l) and there are only two integers 
[TJ and [T], which are possibly nondistinct, in this interval. 
It follows from 427) and (|28) that 



\ - 1 / ak,r 



i-p 



> i. 



(29) 



Therefore, we have 

min(afc,T — l,fc) «fc.T 

E E 



J I ak,T 



P 



-i+j 



1 Qfc.T 

^EE 



[kT\\ 

k-i I / CKk.T 



LfcTjA 



P 



1-p 



' ^ fe_i J I ak T 

h ( L \ TJ ) v j 



i-p 



(30) 



> 



E Pn 

J=2 L J 



3 

> 1, from 



1-p 



1-p 



from ( f26b 



fc-i J 



[kT\ -k + 1 

k 



[k([T\ 



•r)J 



Inequality ( f30l > follows from the fact that 

min(afe ! T — 1, fc) > min(2— 1,1) = 1. 
where r = T - |TJ £ [0, 1) Consequently, 



> 



k 

k\T] 



Lfcrj 



E E f i) 

i=l 3=i+l v 



«y v i 



(25) 



A(p,T,fc) > 0, from CG3. 
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It follows that 

P s (p, T, m= [T\ ) < P s (p, T, m= \2T\ ) 

< ••• <P s (p,T,m=LLfjTj), 

and so we conclude that an optimal to* is given by either 
to = LLf \T\ or to = n. M 

Proof of Corollary |2} If p > j^pj^ then we necessarily 



for any T > 2 and p > j^pj, as required. ■ 
Proof of Lemma \2} Suppose that T > 1. We will show 
that if condition ( TT2t or condition ( fT3l ) is satisfied, then 
A(p,T, k) < for any fc e Z + . First, we note that for any 

i e {i,...,fc}, 

i terms 

( L feT J ^ 



\ k — i J 



(k){k-l)---(k-i + l) 



have T > 2; otherwise, T e [1, 2) would imply that |TJ =1, / |fcTj \ (LfcTj - fc + »)• ■ -(LfcTj - jfe + 2)(LfcTj - k + 1) 
which produces p > jAjtt = |, contradicting the definition of V k ) v * ' 



p. We will show that condition (fTTT i of Theorem [3] is satisfied 
for any T > 2 and p > g^j. To do this, we define the function 

f(p 7 T)^(l-p)m+2[T\p(l- P ri- 1 -l, 



and show that f(p, T) < f (p=jfrj, Tj < for any T > 2 
and p > 3|yj- 

The partial derivative of /(p, T) wrt p is given by 



dp 



f(p, T) = IT\ (1 - p) L T J " 2 (1 + p - 2 [T\p) . 



Observe that /(p, T) is decreasing wrt p for any T > 2 and 
P > 3jTJ' smce 
4 



P 



> 



1 



> 



1 



3LTJ I LTJ 2LTJ-1 



2[T\p-p> 1 1 + p - 2[Tjp < 
Now, consider the function 



^f(p,T)<0. 



g(T) = f (p=3^j,t) = ( i - 



3 LTJ 



IT 1-1 



o p 



4 
3 LTJ 



-1. 



We will proceed to show that g{T) < for any T > 2. For 
T G [2,3), we have |TJ = 2 and g(T) = 0. To show that 
g(T) < for any T > 3, we consider the function 



ft(T) = (T - 1) In (^1 
which has the derivatives 

h'(T) 

h"(T) = 



4 

3T 
11 



In 



11 



4 

3T 



3T-4 11T-4 
16 (llT 2 - 24T- 16) 



In 1 - 



3T 



T(33T 2 -56T+16) 2 ' 

Since h"{T) > for any T > 3, and lim T ->oo h'(T) = 0, it 
follows that h'(T) < for any T > 3. Now, since /i'(T) < 
for any T > 3, and fc(T=3) = In f - 2 In § < 0, it follows 
that h(T) < for any T > 3. Thus, for any T > 3, we have 

4 \ /ll 4 \ 

ML^JXo 



(LTJ-I)ln 1- 



ln 



3 LTJ 



3 3[TJ 



3 LTJ 



IT 1-1 



11 

¥ 



3 LTJ 



< 



1 



IT 1-1 



11 

¥ 



3|TJ / V 3 3 L T J 
Combining these results, we obtain 

4 



< 1 



g(T) < 0. 



f(p,T)<f[p 



3 LTJ 



T = g(T) < 



< 



< 



LfcTj - fc + 1 
fc 

fcT - 1 - k + 1 
1 



v T- 1 

Now, if condition ( fT2l is satisfied, then 

i=l j=i+l v \ J 

T-l T , 1 

= H ( T- 1 

i=l x 

T-l T /mX , 1 



(3D 



1-p 



-t+3 



T 



1-i 



5 E.a 



j=l j=i + l 
T 

£=2 



T- 1 
1 

T - 1 



= 1. 



On the other hand, if condition ( fT3l is satisfied, then 

E E ^ 

i— 1 j=i-(-l 
[Tl-l [T] 

- E E 

i=l j=i+l 

m n-i 



T-l J \ 3 J \1-P 
\T] \ ( 1-P 



j J \ P (T-l)J \l-p 
1 -p 



i-p 



E ( E v 

1=1 \r=l 

r(i,(i_ i)m-i_ p(1 _ p) m-i^ 
(l-prjCi-^^-^i-pjm-i " L 

Thus, if either condition is satisfied, we have 

?S(^KT)(^r-» 

^^'gf^YmV^r^i. (33, 



z— 1 j=i+l 



T-l 



1-p 



As in the proof of Theorem [3j we note that ah t — 
l(k + 1)TJ - [fcTj e {LTJ, [T]}. It follows from (El' and 
that 



E E 

i=i j=i+i 



T-i J V i 



i-p 



-i+3 



< I- (34) 
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Therefore, we have 

min(afc,T-l,fc) Qfc.T ( l kT i ] , \ / 

y- \ - V k-i J I a k . T ) ( P 

min(a fciT -l,fc) a kiT . j . . . 

< £ e(^i)(T)(i^) 



< 



< 1, from ([341 , 
Consequently, 

min(a k/r ~l,k) a k T 

E E 

i=l j=i+l 



LfcTj 
k — i 



a.k,T 
J 



-i+3 



<-{ lk V) 



<*=S> A(p, P, k) < 0, from (fT0>. 

It follows that 

P s (p, T, m= LTJ ) > P s (p, P, m= |2PJ ) 

>P s (p,r,m=L3Tj) > 



and since 

P s (p, m=n) 



if f e Z+ 



= P s (p,T,m=LLfJrJ) 
< P s (p, T, m=[(Lf J + 1) Tj) otherwise 



we conclude that an optimal to* is given by m = [rj . ■ 
Proof of Lemma [5} Since x (n, T, m= |_PJ ) is indeed 
optimal for any p when T = 1, we need only consider the 
case of T > 1. We will show that either condition dT2l > or 
condition ( TT3b of Lemma [2] is satisfied for any T > 1 and 
P < yyj — y ■ We do this in two steps: First, we define the 
function 

p(l-p)^- 1 , 



- (I - - 

TV T 



and show that /(p,T) < / (p=yfy-i,TJ < for any 

T > 1 and p < -h^t — ^. Second, we apply the appropriate 
condition from Lemma [2] for each pair of T and p. 
The partial derivative of /(p, P) wrt p is given by 

9 f{p)T)= (l-p\T])(l-p)^- 2 



Observe that /(p, P) is nondecreasing wrt p for any T > 1 
and p < r|rr — ^, since 



< 



[P 

p[ri < i 



T ~ \T] \T] \T\ 



l-p\T] >0^— /(p,T)>0. 



Now, consider the function 



We will proceed to show that g(T) < for any T > 1 
by reparameterizing g(T) as h(c,r), where c= [~T] and 
t = |"T] — T: 

\c-l 

1-2 



/i(c,r) =g{T=c-r) = 



l 



1 

c— r 



- 1. 



1 

C — T 



1 



1 

C — T 



The partial derivative of h(c, r) wrt r is given by 



d_ 



h(c,r) 



2r\c-2) + ^ 



(c(c-l-r) + 2r) 2 (l--i-) 



Since -§ph(c, r) < for any c G Z+, c > 2, and r G [0, 1), it 
follows that for any P > 1, we have 

. 9 (p) = / l ( C =rpi,r=rri-r) 

< /i(c=[P],r=0) 



rp) ( 



2 

m 



i 



[Tl-l 



fTi-i 



-1=0. 



m (- 1 m) 

Combining these results, we obtain 

f(p,T) < f (p=~j;,T) =g(T)<0 
for any P > 1 and p < p|j — y, which implies 

P(i-p) m " 



■4H) 



Finally, we apply the appropriate condition from Lemma [2] 
for each pair of P and p. For P G Z + , P > 1, we 
have -pyj — i = ^: we use condition ( fT2l for p = ip, and 
condition O for p < ±. For P ^ Z+, P > 1, we have 
p|y — i < ~: we use condition ( TT3l ) for p < ^. ■ 
Proof of Theorem [?} Since x(n,P, to=[PJ) is indeed 
optimal for any p when T = 1, we need only consider the case 
of P > 1. We will show that x(n,P, m=\T\) is an optimal 
symmetric allocation for any T > 1 and p < yip. We do this 
by considering subintervals of P over which \ T\ is constant. 

Let P be confined to the unit interval (c, c + 1], where 
c G Z + . According to Lemma [3] x (n, P, m=\T\) is optimal 



for any p G 
for any 



0. 



c+l 



and P G (c, c + 1], or equivalently, 



pG 



and PG 



2 
c+l 



n (c,c + 1]. 



This is just the area below a "peak" in Fig. |4] ex- 
pressed in terms of different independent variables. For each 
p G ^0, we can always find a Tq such that 



P)G 



2 
c+l 



P 



-,C+1 



For example, we can pick Pq = c 
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Now, we make the crucial observation that if x (n, T, m=\T\ ) 
is an optimal symmetric allocation for T = To, then 
x(n,T, m=\T\) is also an optimal symmetric allocation for 
any T e [ |_TqJ , To] . This claim can be proven by contradic- 
tion: the recovery probability for x (n, T, m=|_Tj ) is given 
by 

P s (p,T,m=LTj)=P[£(LTj,p)>l] 

which remains constant for all Te [[To J, To], and a 
symmetric allocation that performs strictly better than 
x{n,T,m=[T\) for some Tg[|T J,T ] would there- 
fore also outperform x(n, T, m=[Tj) for T = T). Since 
x(n,T, m=[Tj) is indeed optimal for our choice of To, it 
follows then that x(n,T, m=\T\) is also optimal for any 



and T G (c,c+ 1]. 



pe o 



l 



By applying this result for each c G Z + , we reach the conclu- 
sion that x [n, T, m= [TJ ) is an optimal symmetric allocation 
for any T > 1 and p < -p^y. 

Finally, to extend the optimality of x(n,T,m—[T\) to 
p= j^tt, we note that the recovery probability Ps(p,T,m) 

= P \B(m,p) > r^l] is a polynomial in p and is therefore 
continuous at p = ™y. Since x (n, T, m=L^J ) is optimal as 
p , it remains optimal at p = ■ 

Proof of Proposition [2} Consider an allocation 
(xi, . . . , x n ) where each Xi is a nonnegative rational number. 
The problem of computing the recovery probability for this 
allocation and a given subset size r is equivalent to the 
counting version of the following decision problem (which 
happens to be polynomial-time solvable): 

Definition. Largest t-Subset Sum (LRSS) 

Instance: Finite rt-vector (ai, . . . , a n ) with aj G Zj , file size 

d G Z+, and subset size r G Z + , where all cij and c? can be 

written as decimal numbers of length at most I. 

Question: Is there an r-subset r C {1, ...,n} that satisfies 

£ iGr 

Note that the allocation and file size have been scaled so 
that the problem parameters are all integers. To show that the 
counting problem #LRSS is #P-complete, we essentially apply 
the proof of Proposition Q] substituting #LSS with #LRSS, 
and stipulating that the subset size r = m + k in the Turing 
reduction. ■ 
Proof of Lemma @- Summing up the c inequalities of 
(TT~8T > produces 

c 

j=l i£ rj 

The terms on the left-hand side can be regrouped to obtain 

c 

E E 1 [* e r ^ x * - c - 

ies ]=i 

Substituting (fT~9T > into the above inequality yields 

b Xi > c, 



ies 



Proof of Lemma |5} Let S{_ be the collection of all ( ™ ) 
possible 7--subsets of {1, . . . , n}. If P$ = 1, then any feasible 
allocation must satisfy 



Xi>l V r G 



Observe that each element in {l,...,n} appears the same 
number of times among the ?'-subsets in 3{_. Specifically, 
the number of r-subsets that contain element i G {1, . . . ,n} 
is just the number of ways of choosing the other (r— 1) 
elements of the r-subset from the remaining (n — 1) elements 
of {1, . . . , n}, i.e., 



E 1 



i G r = 



n- 1 

r - 1 



V ie{l, 



..,n 



}■ 



Applying Lemma [4] with 5 = {1, . . . , n}, c 
6= therefore produces 



and 



i=l 



Xi > 



(-) 



n 
r 



for any feasible allocation. Now, is a feasible 

allocation since it has a recovery probability of exactly 1; 
because it uses the minimum possible total amount of storage 
-, this allocation is also optimal. ■ 
Proof of Theorem [5} Suppose that n is a multiple of r; 
let positive integer a be defined such that n = ar. 

We will first prove that Ps > 1 — ^ is a sufficient condition 
for the optimality of (h, •■•,¥) by showing that if the 
constraint 



E 



Xi > 1 



(35) 



is satisfied for more than (1 — — ) (™J distinct r-subsets 
r Q {lj • • • i n}, then the allocation f i, . . . , i ) minimizes the 
required budget T. Our approach is motivated by the ob- 
servation of Lemma [4] We begin by constructing a col- 
lection of r-subsets such that if constraint ( f35T > is satisfied 
for the r-subsets in this collection, then J27=i Xi — ^ e 
then demonstrate that such a collection of r-subsets can be 
found among any collection of more than (l — M (™) distinct 
r-subsets. 
Let 

Q = (vi,...,v Q ) 

be an ordered partition of {1, . . . , n} that comprises a parts, 
where |vj| = r, j = 1, . . . , a. For a given ordered partition Q, 
we specify a collection of a distinct r-subsets 

%q = {ri, . . . ,r Q }, 
where r 3 - = Vj, j = 1, . . . , a. 

Fig. |9]provides an example of how Q and %q are constructed. 
Let A be the total number of possible ordered partitions Q. 
By counting the number of ways of picking Vj, we have 

'ar\ ( (a — l)r\ / (a — 2)r\ /r\ (ar)\ 
r ) \ r ) \r) _ (r!) Q ' 



A 



as required. 
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Let (n,r) = (8,4). 
Writing n=ar gives a = 2. 



An example of an ordered partition is 

Q = ({1,2,3,4},{5,6,7,8}). 

Its corresponding collection of r-subsets is 
% 3 ={{1,2,3,4},{5,6,7,8}}. 

Fig. 9. Example for the construction of the ordered partition Q and its 
corresponding collection of r-subsets %q, in the proof of Theorem \5\ (when 
n is a multiple of r). 



Let B be the number of ordered partitions Q for which r £ 'Kq, 
for a given r-subset r C {1, . . . , n}. By counting the number 
of ways of picking Vj, subject to the requirement that r £ %q, 
we have 



B = a 



(a - l)r\ ({a - 2) 



r \ a((a — l)r) ! 



(a — 1) terms 



We claim that for any given ordered partition Q, if 

> 1 V r e 3?q, 

then 5Z i=1 Xi > — . To see this, observe that each element 
i E {l,...,n} appears in exactly one of the a r-subsets of 
%Q, i-e., 

5^i[»er] = i V ie{i,..,4 

Applying Lemma [4] with S = {1, . . . , n}, c — a, and b = 1 
therefore produces Yj7=i Xi — f = 

Let ^, be the collection of all (") possible r-subsets of 
{1, . . . , n}. Observe that all A collections %q can be found in 
i.e., 



With each removal of an r-subset from we reduce the 
number of collections !%q that can be found among the 
remaining r-subsets by at most B. It follows that the minimum 
number of r-subsets that need to be removed from so that 
no collections %q remain is at least \^\, where 



A 
B 



(or)! 



ar!((a — l)r) ! 



— ( ™ ) r-subsets are removed from 

n V r 1 



Thus, if fewer than -g 

then at least one collection 'Kq would remain; equivalently, 
some collection 'Kq can be found among any collection of 
more than (l — (") distinct r-subsets. 

We have therefore shown that if Ps > 1 — r> then any fea- 
sible allocation must satisfy Yli=i Xi — 7 1 - Now, (i, . . . , i) 
is a feasible allocation since it has a recovery probability of 
exactly 1; because it uses the minimum possible total amount 
of storage — , this allocation is also optimal. 



We proceed to prove that Ps > 1 — 7^ is a ls° a necessary 
condition for the optimality of by demonstrating 

that this allocation is suboptimal for any Ps < 1 — — . 

For r < n, the allocation (0, has a recovery 

probability of ( n ~ x ) / (™) = 1 — ^ and is therefore a feasible 
allocation for any Ps < 1 — — . Since this allocation uses a 
smaller total amount of storage < — , it is a strictly better 
allocation than {-,-•■,¥) f° r an Y -Ps < 1 — 

For the trivial case r = n, we have 1 — — = 0. The empty 
allocation (0, . . . , 0) is clearly optimal for any Ps < 0. ■ 
Proof of Theorem Suppose that n is not a multiple 
of r; let integers a and r' be as defined in the theorem. For 
brevity, we additionally define positive integers d, m, and m' 
such that 

d = gcd(r, r ), r = md, r = m d. 

We can therefore write n = (am + m')d. 
We will prove that 



Ps > 1 



ad + m' d 



= 1 - 



1 



a + to' 



is a sufficient condition for the optimality of (-, . . . , ij by 
showing that if the constraint 

> 1 



is satisfied for more than 1 



a+m' 



distinct r-subsets 



r C {1, . . . , rt}, then the allocation (i . . . , i) minimizes the 
required budget T. We apply the proof technique of Theo- 
rem [5] but modify the construction of the ordered partition Q 
and its corresponding collection of r-subsets %q to take into 
account the indivisibility of n by r. 

For the moment, we will proceed with the assumption that 
a > 1. Let 

Q = (Ui, . . . ,U m ',Vi, . . . ,V Q ) 

be an ordered partition of {1, . . . , n} that comprises (m' + a) 
parts, where 



|uj-| = d, 

|vj| = r = m d, 



i = l,...,m', 
j = 1, ...,a. 



For a given ordered partition Q, we specify a collection of 
(to' + a) distinct r-subsets 

— {fi, . . . , r m r, r m /_|_i, . . . , r m / +Q }, 



' m — 1 



where r 



A J U if = ' ' TO '' 



«=0 



>j-m' 



if j = to' + 1, . . . , to' + a, 



and Uj = Uj_ m < if j > to'. 

Fig.flQlprovides an example of how Q and 3£q are constructed. 
Let A be the total number of possible ordered partitions Q. 
By counting the number of ways of picking Uj and Vj, we 
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Let(n,r)=(10,4). 

Writing n=ar+r' gives a=l and r'=6. 

We have <i=gcd(r,r') = 2, m=r j d=1, and m'=r'/d=3. 



A & U 



m'd=r' Oct 

An example of an ordered partition is 

<2=({1,2},{3,4},{5, 6}, {7,8,9,10}). 

Its corresponding collection of r-subsets is 
!Kq={{1,2, 3,4}, 

{3,4, 5,6}, 

{5,6, 1,2}, 

{7,8,9,10}}. 

Fig. 10. Example for the construction of the ordered partition Q and its 
corresponding collection of r-subsets Hq, in the proof of Theorem [6] (when 
n is not a multiple of r). 



have 



.4 



(am+m')d\ / (am+m' — l)d\ ( (am+l)d 



d 



d 



m' terms 

/ amd\ f (a— l)md\ I md\ ((am + m')d)\ 
\ md J \ md )"' \md) ~ (d\) m ' ((md)l) a ' 

a terms 

Let B be the number of ordered partitions Q for which re^g, 
for a given r-subset r C {1, . . . , n}. By counting the number 
of ways of picking u_, and vj, subject to the requirement that 
r G ?£q, we have 



((a-l)m+m')d\f ((a-l)m+m'-l)d\ f ((a-l)m+l)d 



in' terms 

(a— l)md\ / (a—2)md\ f md 
md J \ md J \ md 



(a — 1) terms 

, fmd\ Hm.-l)d\ /d 



((a-l)m.+m')d\ ( ((a-l)m+m' -l)d\ ! (am+l)d 
d ) 1 d ) " ' I d 



(m f — m) terms 

/ amd\ / («-l)md\ / md\ 
\ md J \ md J \ md ) 



a 



(((a - l)m + m')d)\ ,(((«- l)m + m')d)\ 



m 



(d\) m ' ((md)!) (d!) TO '((md)!) 
, (((a- l)m + m')d)\ 



— (a + m 

' (d\) m ' ((md)!) a 1 

We claim that for any given ordered partition Q, if 

> 1 Vre% 

then 5^r=i x i — ~- T° see trns ' cons ider the partition of 



{1, . . . , n} formed by sets U and V, where 

m! a 

Correspondingly, we partition into two collections of 
r-subsets %q and %q , where 

> r m'}, = {r TO '-|-l, . . . , r TO '+ a }. 



^ - {n 



Observe that each element i E U appears in exactly one u^, 
which in turn appears in exactly m of the m! r-subsets of %q 
(namely r^r^i, . . . , r^ (m _ 1) , where r e = r e+m , if t < 1), 
i.e., 

V] 1 [i G r] = m V i e U. 

Applying Lemma |4] with S = U, c = m', and b = m therefore 
produces X^ec/ 2 '* — TrT = l 7 - Likewise, observe that each 
element i G V appears in exactly one of the a r-subsets of 
Stf , i.e., 

^ l[?er] = l V i G V. 

Applying Lemma [4] with S = V, c = a, and 6=1 therefore 
produces X^ev x i — a - Combining the sums of U and V 
yields 



i=i 



ieu 



n 
r 



Let ^ be the collection of all (") possible r-subsets of 
{1, . . . , n}. As demonstrated in the proof of Theorem [5] if 
fewer than r-subsets are removed from then at least one 
collection Hq can be found among the remaining r-subsets. 
In this case, we have 



1 



{[am + m')d)j ! 



1 



A 

B ~ a + m' (((a - l)m + m')d)\(md)\ 
Thus, some collection %q can be found among any collection 



of more than 1 — 



distinct r-subsets. 



We have therefore shown that if Ps > 1 — 



then 



a+m' ' 

any feasible allocation must satisfy X™=i x i — r ■ N° w ' 
(-,..., i) is a feasible allocation since it has a recovery 
probability of exactly 1; because it uses the minimum possible 
total amount of storage £ this allocation is also optimal. 

Applying the preceding argument to the degenerate case 
of a = produces = ^7 ( n \ which is consistent with the 
above expression. ■ 
Proof of Corollary \3} Suppose that n is a multiple of 
(n — r); let integer /3 > 2 be defined such that n = f3(n — r) 

If /3 = 2, then rt = 2r, i.e., n is a multiple of r. According 
to Theorem [5] (-,...,-) is an optimal allocation if and only 

r r 1 r 

n zr 2 n 

as required. 

If (8 > 3, then ?? is not a multiple of r. We can write 
n . = a r + r', where a — and r' = n G {r + 1, . . . , 2i — 1}. 
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According to Theorem [6] 
if 

gcd(r, r') 



Ps > 1 



1 



. , i) is an optimal allocation 

gcd(r, n) 



1 



r 
n 



a gcd(r, r') + r' 

To show that Ps > is a ls° a necessary condition for the 
optimality of (-,...,-), we demonstrate that this allocation 
is suboptimal for any Ps < ^- The allocation (1,0,..., 0) has 
a recovery probability of (™Z.± J / (™) = n an d is therefore a 
feasible allocation for any Ps < -. Since this allocation uses 
a smaller total amount of storage 1 < — , it is a strictly better 
allocation than (=,..., =1 for any Ps < — . ■ 
Proof of Lemma [6} At T = — , the recovery probability 
corresponding to a particular choice of £ £ {1,2, ... ,r — 1} 
is given by 



P 



(n,r, 



T-- 



n 

~ ) 
r 



B 



> 



We will prove that the above expression is at most | for any 



*e{i,2, 



..,r- 



1} and r > 2 by showing that 

a 



B[a + b, 



> a 



3 

< - 

- 4 



for any positive integers a and b. To do this, we consider the 
following three exhaustive cases separately: 

Case 1: Suppose that a > 18 and b > 3. We will first 



derive an upper bound for 
finding separate bounds for 



B(a + b 



> a 



a+b 



by 

and 



B 



&>3Ts) >a+l 



we then proceed to show that 



this upper bound is smaller than | for any a > 18 and b > 3. 
For any positive integers a and b, we have 

+ 6 N 



B [a + b, 



< 



gl2(a + b) 



/2tt 



ab 



(36) 



Inequality (f36b follows from the application of the following 
bound for the binomial coefficient: 



a + b\ ^ e 12 <»+ b > (a + b) a+b+ ^ 
a 



< 



which is derived from the following Stirling-based bounds for 
the factorial (see, e.g., BP ): 



f 2irk 



< k\ < V2nk ^~ 
For any positive integers a and 6, we have 



k > 1. 



B [a + b, 



a 



>a+l 



< 



1 



(37) 



a + b / 

which follows from the definition of the median: The mean of 
the binomial random variable B [a + b, -^^j is (a + b) ■ -^-^ 
= a; since the mean is an integer, the median coincides with 
the mean [32). Therefore, according to the definition of the 
median, we have 



B [a + b, 



a + b 



< a 



1 

^2< 



which leads to inequality ( f37b . 

Combining bounds d36l l and fl37| i produces 



B [a + b, 



> a 



< 



e 12(a + b) 



2tt 



l + l> ^ 1 ^ t( h\ 

^T + 2 =/(a ' &) 



for any positive integers a and 6. Now, the upper bound f(a, b) 
is a decreasing function of both a and b since f(a, b) is a 
symmetric function and the partial derivative 



d 6b 2 + 6ab + a e 12 <»+ b > a + b 

Zk /(a ' '~ 12a(a + 6) 2 ^ V 

is negative for any a > 1 and b > 1. Thus, for any a > 18 and 
6 > 3, we have 



/(a, 6) </(a=18,6=3) 



which implies that 



6 

B ( a + b, 



a+b 



1 

+ 2 
> a 



0.749773 < 



4' 



< | for any pos- 



itive integers a > 18 and 6 > 3. 

Case 2: Suppose that b € {1,2}. We will show that 



B 



< | and 



B(a + 2,^)>a 



< 



for any positive integer a. The left-hand side of each inequality 
can be expanded and simplified to obtain the following: 

a a (2a+l) A p i \ 

= Etiprf =/i(a), 



_ q a (5a- + 10a+4) A f ( „\ 
- (a+2)» + 2 - /2(,aj 



The first derivatives of fi(a) and /2(a), which are given by 

/l(a) = T ^ Tr {2-(2a + l)ln(s±l)}, 

/2(a) = J^W^ {( 10a + 10 ) - ( 5fl2 + 10a + 4 ) ln (^) } . 
can be shown to be negative for any a > 1. Since 



A (0=1) 



/a(o=l) = § < f, and both A(a) and / 2 (a) 



are decreasing functions of a for any a > 1, it follows that 
< I an d /2(a) < I for any positive integer a, as re- 
quired. 

Case 3: Suppose that a E {1, 2, . . . , 17}. We will describe 
our approach for a = 1 and a = 2; the proofs for the other 
15 cases are similar, and can be verified with the help of a 
computer. We will show that 



B(fe+l,^)>l 



< I and 



B(& + 2,^): 



< 



for any positive integer b. The left-hand side of each inequality 
can be expanded and simplified to obtain the following: 



B(&+1,^)>1 

(^ + 2 '5T2 



1 



= 1 - 



FT - 9l(b), 



(5+1) 

(6+2)fc+" = 92(b). 



The first derivatives of gi(b) and g%(b), which are given by 
9'i(b) = j^{b\n(^)-l}, 
92(b) = jet^t, {(3& 2 + 46) ln - (66 + 4)} , 

can be shown to be negative for any b > 1. Since 
gi (b=l) = |, .92(6=1) = § < §, and both 9l (b) and ff2 (6) 
are decreasing functions of b for any b > 1, it follows that 
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9i{b) < | and g^ip) < f for any positive integer b, as re- 
quired. ■ 
Proof of Theorem^}; We have already established that the 
choice of £ = r is optimal for any T > j\ it therefore suffices 

to show that £ = r is also optimal for any T G 7(f) r > r) ■ 
The recovery probability corresponding to any 
£ £ {1,2, . . . , r} is given by 

P s (n,r,T,£) = P #(V,niin( CT 

which is a nondecreasing function of T since min(^,lj 
either increases or remains constant at 1 as T increases. More 
precisely, Ps(n,r,T,£) is an increasing function of T on the 
interval (0, j); for higher values of T, the function saturates 
at 1. We can verify this claim by checking that the partial 
derivative 



1 



> 



^-¥[B(r,p)>£] 



P 



-\i-pT 



is positive for any p £ (0,1). 

Now, the recovery probability corresponding to the choice 

of £ = r at T = ^ (|) 7 is given by 

3 



P 



(n,r,T=* ft) 7 ,l=r) = p[B(r,(f)') 



> r 



Since Ps(n, r, T, £) is a nondecreasing function of T, we have 

P s {n,r,T,i 



>- foranyT>-(- 

- 4 y - r \ 4 



On the other hand, for any ^ £ {1, 2, . . . , r — 1}, we have 

3 n 
P s (n, r, T,tj < - for any T < —, 

4 r 

from the upper bound of Lemma [6] It therefore follows that 

'3} 
. 4/ 



the choice of £ = r is optimal for any T £ 
required. 

Proof of Corollary [5} Theorem [7] already demonstrates 
that the choice of £ = r is optimal for any T > 7(f)' ; we 
will proceed to show that a recovery probability of at least | 

is not achievable for any T < — (|) r . 

Recall from the proof of Theorem [7] that the recovery prob- 
ability Ps(n, r, T,£) corresponding to any I £ {1, 2, . . . , r} 
is an increasing function of T on the interval (O, j). Thus, 
for the choice of l = r, the function Ps(n,r,T,£=r) in- 
creases wrt T on the subinterval (0, j (|) r C (0, j); since 



P s (n,r,T=2 ,^=rj = f, it follows that 



3 n 
P s (n, r, T, £=r) < - for any T < - 

4 r 



On the other hand, for any I £ {1, 2, . . . ,r — 1}, the 
function Ps,(n,r,T,£) increases wrt T on the subinterval 
(0, f\ C (0,f); since P s (n, r, T=^, < f according to 
Lemma [6] it follows that 

P s (n, r, T, £) < ^ for any T <—. 

Hence, the optimal recovery probability for any T < — (|) r 
is strictly less than |. ■ 
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